# PersonaEval Supplementary Material

Thank you for reviewing our paper *"PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?"*. This supplementary material provides a convenient way to run 5 example cases of our benchmark (placed in `PersonaEval_example_cases.csv`).

## Setup

### Installation

Install the required dependencies:
   ```
   pip install -r requirements.txt
   ```

## Usage

### Basic Usage

Run the benchmark with the default configuration: 
> Before running, specify your API key in `config.json`.
```
python main.py
```

Or specify a custom configuration file:
```
python main.py custom_config.json
```

### Important Configuration Notes:

1. **API Key**: Replace `"sk-your-api-key"` with your actual API key.

2. **Model**: Specify the model you want to evaluate. The default is "gpt-4o".

3. **Cases**: Specify which cases to run using 1-based indices. For example, `[1, 2, 3]` will run the first three cases in the benchmark file.

4. **Reasoning Models**: If you're evaluating a reasoning model that uses streaming API, set `"is_reasoning_model": true`.

### Output

The benchmark results will be saved in the specified output directory (default: `results`) as a CSV file. The filename will be based on the model name.

Additionally, detailed results will be printed to the console, showing:
- The probability distribution across all options for each case
- The correct answer and model prediction
- Whether the prediction was correct or incorrect
- The total cost of running the cases

## Example

Here's an example of running the benchmark with a custom configuration:

1. Create a custom configuration file `my_config.json`:

```json
{
"model": "gpt-4o",
"api_key": "sk-your-actual-api-key",
"cases": [1, 3, 5],
"max_retries": 2
}
```

2. Run the cases:
```
python main.py my_config.json
```

## Troubleshooting

- **API Errors**: If you encounter API errors, check your API key and ensure you have access to the specified model.
- **Parsing Errors**: If the model response cannot be parsed, try adjusting the temperature or using a different model.
- **File Not Found**: Ensure the benchmark file exists in the specified location.